Page Frame Detection for Marginal Noise Removal from Scanned Documents
نویسندگان
چکیده
We describe and evaluate a method to robustly detect the page frame in document images, locating the actual page contents area and removing textual and non-textual noise along the page borders. We use a geometric matching algorithm to find the optimal page frame, which has the advantages of not assuming the existence of whitespace between noisy borders and actual page contents, and of giving a practical solution to the page frame detection problem without the need for parameter tuning. We define suitable performance measures and evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% for each of the performance measures used. In addition, we demonstrate that the use of page frame detection reduces the OCR error rate by removing textual noise. Experiments using a commercial OCR system show that the error rate due to elements outside the page frame is reduced from 4.3% to 1.7% on the UW-III dataset.
منابع مشابه
Border Noise Removal of Camera-Captured Document Images Using Page Frame Detection
Camera-captured document images usually contain two main types of marginal noise: textual noise (coming from neighboring pages) and non-textual noise (resulting from the page surrounding and/or binarization process). These types of marginal noise degrade the performance of the preprocessing (dewarping) of camera-captured document images and subsequent document digitization/recognition processes...
متن کاملMarginal Noise Removal of Document Images
Marginal noise is a common phenomenon in document analysis which results from the scanning of thick documents or skew documents. It usually appears in the front of a large and dark region around the margin of document images. Marginal noise might cover meaningful document objects, such as text, graphics and forms. The overlapping of marginal noise with meaningful objects makes it di5cult to per...
متن کاملSkew Detection Using the Radon Transform*
In an automatic document conversion system, which builds digital documents from scanned articles, there is the need to perform various adjustments before the scanned image is fed to the OCR system. This is because the OCR system is prone to error when the text is not properly identified, aligned, de-noised, etc. Such an adjustment is the detection of page skew, an unintentional rotation of the ...
متن کاملSalt and Pepper Noise Removal using Pixon-based Segmentation and Adaptive Median Filter
Removing salt and pepper noise is an active research area in image processing. In this paper, a two-phase method is proposed for removing salt and pepper noise while preserving edges and fine details. In the first phase, noise candidate pixels are detected which are likely to be contaminated by noise. In the second phase, only noise candidate pixels are restored using adaptive median filter. In...
متن کاملLocal Thresholding Algorithm Based on Variable Window Size Statistics
In an automatic document conversion system, which builds digital documents from scanned articles, there is a need to perform various adjustments before the scanned image is fed to the layout analysis system. This is because the layout detection system is sensitive to errors when the page elements are not properly identified, represented, denoised, etc. Such an adjustment is the detection of for...
متن کامل